Assignment 6 - EDA &FE - Healthcare Fraud Detection - PCA - Varadharajan Suresh vs2769

Introduction

The inpatientCharges.csv dataset is provided by centers for medicare and mediaid services. It contains discharges and cost by each hospital across different states and geographical zones for various Diagnosis Related Group (DRG).

We shall be preparing and analysing the data to detect anomalies, it identify any abuse of hospital resources for monetary gain. By benchmarking common practices and filtering data beyond acceptable standards.

Reading the csv file 26MB

Rename columns for easy of access and a glimpse into the data

Column Names and Data types

About the data

Checking for missing values

Section 1: EDA

Showing the Distribution of X

Section 1.1: Distribution plots

Count by State

Following we display the count of charges for the various geographic variables.

Count by city: Top 25

Correlation Matrix

We attempt to examine the correlation between the variables and find that the charges a highly correlated

Pairplot

Section 2 : FE

Feature 1

Calculating the average amount spent per DRG irrespective of a state. Then creating a second column to calculate the ratio between the mean and the actual amount. This will help understand the costs comparisons across states

Feature 2

Calculating the average amount spent per DRG and state. Then creating a second column to calculate the ratio between the mean and the actual amount. This will help understand the costs comparisons within states

Feature 3

Calculating the average amount spent per DRG and Hospital_referral_region_desp. Then creating a second column to calculate the ratio between the mean and the actual amount. This will help understand the costs comparisons within Hospital_referral_region_desp

Feature 4

Calculating the average amount spent per DRG and city. Then creating a second column to calculate the ratio between the mean and the actual amount. This will help understand the costs comparisons within cities in a state. We are grouping by state and city to calculate average because there might be cities with same name in different states.

Feature 5

Calculating the average amount spent per DRG and Zipcode. Then creating a second column to calculate the ratio between the mean and the actual amount. This will help understand the costs comparisons within cities in a Zipcode.

Feature 6

Calculating the average discarges per DRG irrespective of a state. Then creating a second column to calculate the ratio between the mean and the actual discharges. This will help understand the discharge comparisons across states

Feature 7

Calculating the average discharges spent per DRG and state. Then creating a second column to calculate the ratio between the mean and the actual discharges. This will help understand the discharges comparisons within states

Feature 8

Calculating the average discharges spent per DRG and Hospital_referral_region_desp. Then creating a second column to calculate the ratio between the mean and the actual discharges. This will help understand the discharges comparisons within Hospital_referral_region_desp

Feature 9

Calculating the average discharges spent per DRG and city. Then creating a second column to calculate the ratio between the mean and the actual discharges. This will help understand the discharges comparisons within cities in a state. We are grouping by state and city to calculate average because there might be cities with same name in different states.

Feature 10

Calculating the average discharges spent per DRG and Zipcode. Then creating a second column to calculate the ratio between the mean and the actual discharges. This will help understand the discharges comparisons within cities in a Zipcode.

The newly creates features

The hospital with more than double the cost and double the discharge rates across the nation

Interpreting the new created features and their charts

Average_Total_Payments

Discharges

Part 2: Unsupervised Learning

Section 1: PCA

Principal component analysis (PCA) can be used in detecting outliers.PCA is a linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.

Spliting data into train and test. 70% of data is assigned as train and 30% of data assigned as test.

Initiating PCA() and fitting it on training data set

Analysing and Identifying the largest outlier

Section 2: Histogram based Outliers

It is an unsupervised distance-based algorithm to capture the outliers. A suitable option for treating global outlier

Method 1: Average

Combination by average The test_scores_norm is 48920 x 10. The "average" function will take the average of the 10 columns. The result "y_by_average" is a single column:

In our case it identifies 291 data points that have the outlier scores higher than 3. In order to get the summary statistics for each cluster, we do the following code, which produces the average values as below.

Method 2: The Maximum of Maximum (MOM)

When we use the Maximum-of-Maximum method, we get 400 data points that have the outlier scores higher than 4. We use the following code to produce the summary statistics by cluster.

Method 3: The Average of Maximum (AOM)

When we use the Average-of-Maximum method, we get 258 data points that have the outlier scores higher than 0. We use the following code to produce the summary statistics by cluster

Method 4: The maximum of average (MOA)

When we use the Maximum-of-Average method, we get 410 data points that have the outlier scores higher than 3.5. We use the following code to produce the summary statistics by cluster.

Conclusions:

We have implimented two unsupervised model PCA and HBO.

Leverageing PCA, 100 charts have been created and we oberve the largest outlier is greater than 25,000. we find that 'ADCARE HOSPITAL OF WORCESTER INC' as the outlier.

From HBOS Analysis we find that over 7% of the data can be classified as outliers.